Diamonds Price Prediction¶

In this notebook I will use machine learning techniques to predict diamonds price

Steps:¶

i)Understanding The Data¶

In [1]:
#importing the libraries 
import pandas as pd
import numpy as np
color_mix=['#FA8072','#DC143C','#8B0000','#FFA500','#FF4500']
In [2]:
# importing the data and analyse it 
diamonds = pd.read_csv('diamonds.csv')
diamonds.head(15)
Out[2]:
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
5 6 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
6 7 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
7 8 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
8 9 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
9 10 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
10 11 0.30 Good J SI1 64.0 55.0 339 4.25 4.28 2.73
11 12 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.90 2.46
12 13 0.22 Premium F SI1 60.4 61.0 342 3.88 3.84 2.33
13 14 0.31 Ideal J SI2 62.2 54.0 344 4.35 4.37 2.71
14 15 0.20 Premium E SI2 60.2 62.0 345 3.79 3.75 2.27
In [3]:
# importing the data and analyse it 
diamonds.tail(15)
Out[3]:
Unnamed: 0 carat cut color clarity depth table price x y z
53925 53926 0.79 Ideal I SI1 61.6 56.0 2756 5.95 5.97 3.67
53926 53927 0.71 Ideal E SI1 61.9 56.0 2756 5.71 5.73 3.54
53927 53928 0.79 Good F SI1 58.1 59.0 2756 6.06 6.13 3.54
53928 53929 0.79 Premium E SI2 61.4 58.0 2756 6.03 5.96 3.68
53929 53930 0.71 Ideal G VS1 61.4 56.0 2756 5.76 5.73 3.53
53930 53931 0.71 Premium E SI1 60.5 55.0 2756 5.79 5.74 3.49
53931 53932 0.71 Premium F SI1 59.8 62.0 2756 5.74 5.73 3.43
53932 53933 0.70 Very Good E VS2 60.5 59.0 2757 5.71 5.76 3.47
53933 53934 0.70 Very Good E VS2 61.2 59.0 2757 5.69 5.72 3.49
53934 53935 0.72 Premium D SI1 62.7 59.0 2757 5.69 5.73 3.58
53935 53936 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 53937 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 53938 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 53939 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 53940 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64
In [4]:
diamonds['carat'].min()
Out[4]:
0.2
In [5]:
diamonds['carat'].max()
Out[5]:
5.01
In [6]:
diamonds['depth'].min()
Out[6]:
43.0
In [7]:
diamonds['depth'].max()
Out[7]:
79.0
In [8]:
diamonds['table'].min()
Out[8]:
43.0
In [9]:
diamonds['table'].max()
Out[9]:
95.0
In [10]:
diamonds['price'].min()
Out[10]:
326
In [11]:
diamonds['price'].max()
Out[11]:
18823
In [12]:
diamonds['x'].min()
Out[12]:
0.0
In [13]:
diamonds['x'].max()
Out[13]:
10.74
In [14]:
diamonds['y'].min()
Out[14]:
0.0
In [15]:
diamonds['y'].max()
Out[15]:
58.9
In [16]:
diamonds['z'].min()
Out[16]:
0.0
In [17]:
diamonds['z'].max()
Out[17]:
31.8

Description Of The Dataset¶

Carat : The Weight of Diamonds It's Range In The Dataset From (0.2-5.01)

Cut : Quality of Diamonds Cut And There's 5 Types Of Cuts(Fair, Good, Very Good, Premium, Ideal) Cut.jpg

Color :(from J (worst) to D (best)) Color.jpeg

Clarity : Diamond clarity Refers To How Flawless A Diamond is Clarity.jpg

Depth : The Depth Of a Diamond Refers To It's Measurement From Top To Bottom and It's Range In The Dataset From (43-79) image.png

table : A Diamond's Table Refers To The Flat Facet Of The Diamond Seen When The Stone Is Face Up and It's Range In The Dataset From (43-95) image.png

Price : Price in US dollars It's Range from(326, 18823)

X : Diamond on the X-axis It's Range From (0-10.74) Length in mm

Y : Diamond on the Y-axis It's Range From (0-58.9) Width in mm

Z : Diamond on the Z-axis It's Range From (0-31.8) Depth in mm

Data Analysis¶

In [18]:
print(diamonds.shape)
(53940, 11)
In [19]:
diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB

Data Is Classified Into Two Types Continous Data and Categorical Data

Categorical Data Is: Cut, Color and Clarity

Continous Data Is : Carat, Depth, Table, X, Y and Z

i)Categorical Data¶

In [20]:
cut=diamonds.cut.value_counts()
cut
Out[20]:
Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64
In [21]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
f, ax = plt.subplots(figsize=(25,10))
ax.pie(cut, labels=cut.keys(), autopct='%1.1f%%')
ax.legend(labels=cut.keys(), loc=2)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Types Of Cut's Percentages", fontdict = font1)
Out[21]:
Text(0.5, 1.0, "Types Of Cut's Percentages")

The Percentage Of The Ideal Is The Most Than The Rest Of The Cuts and Fair Is The lowest Percentage So Therefor The Types Of Cuts Is not Uniform

In [22]:
df = sns.load_dataset("diamonds")
sns.barplot(data=df, x="cut", y="price")
font1 = {'family':'serif','color':'black','size':20}
plt.title("Comparison Between Cut and Price", fontdict = font1)
Out[22]:
Text(0.5, 1.0, 'Comparison Between Cut and Price')

We See That's The Normal To See The Premium Cut Is The Highest At The Price But What's Not Normal To See That's Fair cut Is The 2nd Highest After The permium Cut Mean while Fair Cut is The Worst Type Of Cuts and How Ideal At The cost Is Smaller Than Very Good Cut And Good cut

In [23]:
color=diamonds.color.value_counts()
color
Out[23]:
G    11292
E     9797
F     9542
H     8304
D     6775
I     5422
J     2808
Name: color, dtype: int64
In [24]:
f, ax = plt.subplots(figsize=(25,10))
ax.pie(color, labels=color.keys(), autopct='%1.1f%%')
ax.legend(labels=color.keys(), loc=2)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Types Of color's Percentages", fontdict = font1)
Out[24]:
Text(0.5, 1.0, "Types Of color's Percentages")

J Is The Worst Colour And Have The Lowest Percentage While D Is The Best Colour and Have Average Percentage G is An Average Colour And Have The Highest Percentage while E is the 2nd Highest in both Percentge And In The Colour

In [25]:
df = sns.load_dataset("diamonds")
sns.barplot(data=df, x="color", y="price")
font1 = {'family':'serif','color':'black','size':20}
plt.title("Comparison Between color and Price", fontdict = font1)
Out[25]:
Text(0.5, 1.0, 'Comparison Between color and Price')

J and I have The worst colour Type But They Have The Highest Price Mean While D and E are The Best Colour Type and They Have The lowest Price

In [26]:
clarity = diamonds.clarity.value_counts()
clarity
Out[26]:
SI1     13065
VS2     12258
SI2      9194
VS1      8171
VVS2     5066
VVS1     3655
IF       1790
I1        741
Name: clarity, dtype: int64
In [27]:
f, ax = plt.subplots(figsize=(25,10))
ax.pie(clarity, labels=clarity.keys(), autopct='%1.1f%%')
ax.legend(labels=clarity.keys(), loc=2)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Types Of clarity's Percentages", fontdict = font1)
Out[27]:
Text(0.5, 1.0, "Types Of clarity's Percentages")

IF (internally Flawless ) Have The 2nd Lowest Percentage By 3.3% While L1 (Icluded) Have The Lowest Percentage By 1.4% VS2(Very Slightly Included) and Sl1 (Slightly Included) Have Average Percentage In Clarity And they're average in the types of Clarity

In [28]:
df = sns.load_dataset("diamonds")
sns.barplot(data=df, x="clarity", y="price")
font1 = {'family':'serif','color':'black','size':20}
plt.title("Comparison Between clarity and Price", fontdict = font1)
Out[28]:
Text(0.5, 1.0, 'Comparison Between clarity and Price')

Slightly Included Have the Highest price Mean While the Internally Flawless that must have the highest price have the second lowest price

It Seems That The Data Having Outliers At Specific Categoricals

For Example In Color As D Must Have The Highest Price But The Contrary Happened

In Clarity Inter Flawless Must Have The Highest Price But On This Dataset IF Happened TO have The 2nd Lowest Price

In Cut Fair Is 2nd Highest price

In [29]:
clarity_cut_table = pd.crosstab(index=diamonds["clarity"], columns=diamonds["cut"])

clarity_cut_table.plot(kind="bar", 
                 figsize=(10,10),
                 stacked=True)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Clarity vs Cut", fontdict = font1)
Out[29]:
Text(0.5, 1.0, 'Clarity vs Cut')

You Can See That Most Of The People Prefer To Buy Diamond Of SI1 Clarity Followed By VS2, SI2, and VS1 The Cut They Prefer Is Ideal, Premium, and very good's Diamond Cut Category People Are Not Taking The Highest Clarity Diamonds Such As IF or VVS1 or Others Are Rready To Sacrifice On Clarity But Are More Focusing On The Cut Of The Diamond

In [30]:
cut_clarity_table = pd.crosstab(index=diamonds["cut"], columns=diamonds["clarity"])

cut_clarity_table.plot(kind="bar", 
                 figsize=(10,10),
                 stacked=True)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Cut vs Clarity", fontdict = font1)
Out[30]:
Text(0.5, 1.0, 'Cut vs Clarity')

people Prefer Ideal Cut Over Any Other Cut Diamonds Followed By Premium And Very Good

people Are Focusing On Cut Than Clarity

In [31]:
color_clarity_table = pd.crosstab(index=diamonds["color"], columns=diamonds["clarity"])

color_clarity_table.plot(kind="bar", 
                 figsize=(8,9),
                 stacked=True)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Color vs Clarity", fontdict = font1)
Out[31]:
Text(0.5, 1.0, 'Color vs Clarity')

People Prefer G Color Followed By E, F, and H

The Clarity They Mostly Prefer SI1 Category.

From The above plots, The Carat Has The Highest Importance Followed by Cut, Color, And Clarity In The Prediction Of The Price Of The Diamonds

Converting Categorical values¶

In [32]:
diamonds['clarity']=diamonds['clarity'].replace(['IF','VVS1','VVS2','VS1','VS2','SI1','SI2','I1'],[8,7,6,5,4,3,2,1])
diamonds['color'] = diamonds['color'].replace(['D','E','F','G','H','I','J'],[7,6,5,4,3,2,1])
diamonds['cut'] = diamonds['cut'].replace(['Ideal','Premium','Very Good','Good','Fair'],[5,4,3,2,1])

diamonds.head()
Out[32]:
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 5 6 2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 4 6 3 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 2 6 5 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 4 2 4 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 2 1 2 63.3 58.0 335 4.34 4.35 2.75

Here I replaced Every Categorical values Into Numerical One And I Assumed That The Highest One In Every Category Will Takes The Highest Number

ii)Continous Data¶

The First Column In The Dataset Is Usless So I Will drop it

In [33]:
diamonds.drop('Unnamed: 0',inplace=True,axis=1)
In [34]:
diamonds.head()
Out[34]:
carat cut color clarity depth table price x y z
0 0.23 5 6 2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 4 6 3 59.8 61.0 326 3.89 3.84 2.31
2 0.23 2 6 5 56.9 65.0 327 4.05 4.07 2.31
3 0.29 4 2 4 62.4 58.0 334 4.20 4.23 2.63
4 0.31 2 1 2 63.3 58.0 335 4.34 4.35 2.75
In [35]:
diamonds.describe()
Out[35]:
carat cut color clarity depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 0.797940 3.904097 4.405803 4.051020 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 0.474011 1.116600 1.701105 1.647136 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 0.200000 1.000000 1.000000 1.000000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 0.400000 3.000000 3.000000 3.000000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 0.700000 4.000000 4.000000 4.000000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 1.040000 5.000000 6.000000 5.000000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 5.010000 5.000000 7.000000 8.000000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000

You Can See that There's 0 values In the columns x, y and z

That's mean that there are diamonds which have no dimensions So I'll Eliminate those values

In [36]:
#Dropping dimensionless features
diamonds = diamonds.drop(diamonds[diamonds['x'] == 0].index)
diamonds = diamonds.drop(diamonds[diamonds['y'] == 0].index)
diamonds = diamonds.drop(diamonds[diamonds['z'] == 0].index)
In [37]:
diamonds.describe()
Out[37]:
carat cut color clarity depth table price x y z
count 53920.000000 53920.000000 53920.000000 53920.000000 53920.000000 53920.000000 53920.000000 53920.000000 53920.000000 53920.000000
mean 0.797698 3.904228 4.405972 4.051502 61.749514 57.456834 3930.993231 5.731627 5.734887 3.540046
std 0.473795 1.116579 1.701272 1.647005 1.432331 2.234064 3987.280446 1.119423 1.140126 0.702530
min 0.200000 1.000000 1.000000 1.000000 43.000000 43.000000 326.000000 3.730000 3.680000 1.070000
25% 0.400000 3.000000 3.000000 3.000000 61.000000 56.000000 949.000000 4.710000 4.720000 2.910000
50% 0.700000 4.000000 4.000000 4.000000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 1.040000 5.000000 6.000000 5.000000 62.500000 59.000000 5323.250000 6.540000 6.540000 4.040000
max 5.010000 5.000000 7.000000 8.000000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000

Checking for Outliers¶

In [139]:
ax = sns.pairplot(diamonds, hue= "cut", palette = "Spectral")

Closer Look At The Continous Categories With Price

In [38]:
sns.set_palette("afmhot")
cols = ['carat','x','y','z','table','depth']
c = 0
fig, axs = plt.subplots(ncols = len(cols), figsize=(20,7))
for i in cols :
    sns.scatterplot(data = diamonds,x = diamonds['price'],y = diamonds[i], ax = axs[c])
    c+=1
In [39]:
diamonds = diamonds[(diamonds['y'] < 30)]
diamonds = diamonds[(diamonds['z'] < 30) & (diamonds['z'] > 2)]
diamonds = diamonds[(diamonds['table'] < 80) & (diamonds['table'] > 40)]
diamonds = diamonds[(diamonds['depth'] < 75) & (diamonds['depth'] > 45)]

As You Can See No Outliers At Carat With Price And X With Price

Meanwhile There's Outliers At Y, Z, Table, and Depth With Price

So I removed The Outliers According To the Scattered Plot

To Make Sure That I Removed The Outliers

In [40]:
diamonds.shape
Out[40]:
(53907, 10)
In [41]:
sns.set_palette("afmhot")
cols = ['y','z','table','depth']
c = 0
fig, axs = plt.subplots(ncols = len(cols), figsize=(20,7))
for i in cols :
    sns.scatterplot(data = diamonds,x = diamonds['price'],y = diamonds[i], ax = axs[c])
    c+=1
In [42]:
#Examining correlation matrix using heatmap
cmap = sns.diverging_palette(205, 133, 63, as_cmap=True)
cols = (["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
corrmat= diamonds.corr()
f, ax = plt.subplots(figsize=(15,12))
sns.heatmap(corrmat,cmap=cols,annot=True)
Out[42]:
<AxesSubplot:>

It Seems That There's a lot of correlation

But It Make Sense Because x y z = volume and carat depends on volume As Shown In The plot

So I will Introduce A New Coloum In The Dataset (Volume=x y z)

In [43]:
diamonds["volume"] = diamonds.x * diamonds.y * diamonds.z
diamonds["volume"].head()
Out[43]:
0    38.202030
1    34.505856
2    38.076885
3    46.724580
4    51.917250
Name: volume, dtype: float64
In [44]:
diamonds.head()
Out[44]:
carat cut color clarity depth table price x y z volume
0 0.23 5 6 2 61.5 55.0 326 3.95 3.98 2.43 38.202030
1 0.21 4 6 3 59.8 61.0 326 3.89 3.84 2.31 34.505856
2 0.23 2 6 5 56.9 65.0 327 4.05 4.07 2.31 38.076885
3 0.29 4 2 4 62.4 58.0 334 4.20 4.23 2.63 46.724580
4 0.31 2 1 2 63.3 58.0 335 4.34 4.35 2.75 51.917250
In [45]:
diamonds = diamonds.drop(columns={"x","y","z"})
In [46]:
diamonds.head()
Out[46]:
carat cut color clarity depth table price volume
0 0.23 5 6 2 61.5 55.0 326 38.202030
1 0.21 4 6 3 59.8 61.0 326 34.505856
2 0.23 2 6 5 56.9 65.0 327 38.076885
3 0.29 4 2 4 62.4 58.0 334 46.724580
4 0.31 2 1 2 63.3 58.0 335 51.917250
In [47]:
#Examining correlation matrix using heatmap
cmap = sns.diverging_palette(205, 133, 63, as_cmap=True)
cols = (["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
corrmat= diamonds.corr()
f, ax = plt.subplots(figsize=(15,12))
sns.heatmap(corrmat,cmap=cols,annot=True)
Out[47]:
<AxesSubplot:>

As Seen In The Heatmap There's A Correlation Between The Price and The Volume With The Carat

Prepare The Data To The ML Models¶

Test If The Price will Be Normalized Or Not

In [48]:
skY=diamonds['price'].skew()
skY
Out[48]:
1.6186409761621152

1.618 Skew Mean That The Data Not Normalized If It's Near Zero So It's Normalized

In [49]:
from sklearn.preprocessing import MinMaxScaler,StandardScaler,PolynomialFeatures
from sklearn import preprocessing
plt.figure(figsize=[15,5])
plt.subplot(1,2,1)
plt.hist(diamonds['price'], bins=50, ec='black', color='#2196f3')
plt.xlabel('Price in thousands')
plt.ylabel('Number of Diamonds')
plt.title(f'Before Log transformation, Skew:{round(skY,3)}')

plt.subplot(1,2,2)
Y = np.log(diamonds['price'])
plt.hist(Y, bins=50, ec='black', color='#2196f3')
plt.xlabel('Price in logs')
plt.ylabel('Number of Diamonds')
plt.title(f'After Log transformation, Skew:{round(sk,3)}')
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [49], in <cell line: 15>()
     13 plt.xlabel('Price in logs')
     14 plt.ylabel('Number of Diamonds')
---> 15 plt.title(f'After Log transformation, Skew:{round(sk,3)}')
     16 plt.show()

NameError: name 'sk' is not defined

Now The Price Is Normalized

Check The Rest Of The Data Scaled Or Not¶

In [50]:
X_notdum = diamonds
figure = plt.figure(figsize=(15,10))
for n, col in enumerate(X_notdum.columns):
    ax = figure.add_subplot(3,4,n+1)
    ax.set_title(col)
    X_notdum[col].hist(ax=ax, bins=50)
    
figure.tight_layout() #this feature separate the graphs correctly
plt.show()
In [51]:
x=diamonds.drop(["price"],axis =1)

From The Above Curve It Appears As X Needs To Be Scaled

In [52]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()

Xss = sc_X.fit_transform(x)

train_SS = pd.DataFrame(Xss, columns=['carat', 'cut', 'color', 'clarity', 'depth', 'table',
       'volume'])

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))
ax1.set_title('Before Scaling')
for e in x.columns:
    sns.kdeplot(x[e], ax=ax1)
ax2.set_title('After Standard Scaling')
for e in train_SS.columns:
    sns.kdeplot(train_SS[e], ax=ax2, legend=None)
plt.show()
In [53]:
from sklearn.preprocessing import MinMaxScaler
mmc_X = MinMaxScaler()

Xmm = mmc_X.fit_transform(x)

train_MM = pd.DataFrame(Xmm, columns=['carat', 'cut', 'color', 'clarity', 'depth', 'table',
       'volume'])

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))
ax1.set_title('Before Scaling')
for e in x.columns:
    sns.kdeplot(x[e], ax=ax1)
ax2.set_title('After Min-Max Scaling')
for e in train_MM.columns:
    sns.kdeplot(train_MM[e], ax=ax2, legend=None)
plt.show()

ML MODELS¶

In [54]:
# Defining the independent and dependent variables
x=diamonds.drop(["price"],axis =1)
y= diamonds["price"]
In [55]:
x
Out[55]:
carat cut color clarity depth table volume
0 0.23 5 6 2 61.5 55.0 38.202030
1 0.21 4 6 3 59.8 61.0 34.505856
2 0.23 2 6 5 56.9 65.0 38.076885
3 0.29 4 2 4 62.4 58.0 46.724580
4 0.31 2 1 2 63.3 58.0 51.917250
... ... ... ... ... ... ... ...
53935 0.72 5 7 3 60.8 57.0 115.920000
53936 0.72 2 7 3 63.1 55.0 118.110175
53937 0.70 3 7 3 62.8 60.0 114.449728
53938 0.86 4 3 2 61.0 58.0 140.766120
53939 0.75 5 7 2 62.2 55.0 124.568444

53907 rows × 7 columns

In [56]:
y
Out[56]:
0         326
1         326
2         327
3         334
4         335
         ... 
53935    2757
53936    2757
53937    2757
53938    2757
53939    2757
Name: price, Length: 53907, dtype: int64
In [60]:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x, y,test_size=0.20, random_state=25)
In [61]:
#libraries
import pandas as pd
from pandas import DataFrame,Series
import matplotlib.pyplot as plt
import numpy as np
import warnings
import seaborn as sns
sns.set(context="notebook", palette="Spectral", style = 'darkgrid' ,font_scale = 1.5, color_codes=True)
from sklearn.preprocessing import MinMaxScaler,StandardScaler,PolynomialFeatures
from sklearn import preprocessing

from sklearn.linear_model import LinearRegression,Ridge,Lasso, ElasticNet,SGDRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV , KFold , cross_val_score

from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error 
from sklearn.pipeline import Pipeline

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

import os
In [62]:
#Show the results of the split
print("Training set has {} samples.".format(x_train.shape[0]))
print("Testing set has {} samples.".format(x_val.shape[0]))
Training set has 43125 samples.
Testing set has 10782 samples.

Linear Model¶

linear regression model provides a sloped straight line representing the relationship between the variables linear regression goal is to find the best fit line that means the error between predicted values and actual values should be minimized The best fit line will have the least error

In [76]:
dlin = LinearRegression()
dlin.fit(x_train, y_train)
dlin_pred = dlin.predict(x_val)
print('####### Linear Regression #######')
print('Score : %.4f' % dlin.score(x_val, y_val))
dlin_r2 = dlin.score(x_val, y_val)
dlin_mse = mean_squared_error(y_val, dlin_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('')
print('MSE    : %0.2f ' % dlin_mse)
print('R2     : %0.2f ' % dlin_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - dlin_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### Linear Regression #######
Score : 0.9072

MSE    : 1219.40 
R2     : 0.91 
Adjusted R Squared: 0.9071803954289116
In [78]:
plt.figure(figsize=(7,7))     
sns.regplot(y_val, dlin_pred, fit_reg=True)
C:\Users\Menna Fawzy\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Out[78]:
<AxesSubplot:xlabel='price'>

NonLinear Models¶

Decision Tree

Decision Tree From it's Name you can understand it asking some questions to narrow the information till you get the answer that you want It's consist of nodes you have a main node that's called ROOT NODE and it have most of the information that it will be breaked down into smaller nodes that carry less information

In [74]:
from sklearn.tree import DecisionTreeRegressor
dtm = DecisionTreeRegressor(min_samples_split=40, max_features="auto")
dtm.fit(x_train, y_train) 
dtm_pred = dtm.predict(x_val)
dtm_r2 = dtm.score(x_val, y_val)
dtm_mse = mean_squared_error(y_val, dtm_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('####### DecisionTree Classifier #######')
print('Score : %.4f' % dtm.score(x_val, y_val))
print('')
print('MSE    : %0.2f ' % dtm_mse)
print('R2     : %0.2f ' % dtm_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - dtm_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### DecisionTree Classifier #######
Score : 0.9758

MSE    : 622.66 
R2     : 0.98 
Adjusted R Squared: 0.9757978613237582
In [81]:
plt.figure(figsize=(7,7))     
sns.regplot(y_val, dtm_pred, fit_reg=True)
Out[81]:
<AxesSubplot:xlabel='price'>

The Random forest classifier

The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set It is basically a set of decision trees from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction

In [84]:
from sklearn.ensemble import RandomForestRegressor
rfm = RandomForestRegressor(n_estimators=500 ,min_samples_split=40, max_features="auto", min_samples_leaf=1, bootstrap=True)
rfm.fit(x_train, y_train) 
rfm_pred = rfm.predict(x_val)
rfm_r2 = rfm.score(x_val, y_val)
rfm_mse = mean_squared_error(y_val, rfm_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('####### RandomForest Classifier #######')
print('Score : %.4f' % rfm.score(x_val, y_val))
print('')
print('MSE    : %0.2f ' % rfm_mse)
print('R2     : %0.2f ' % rfm_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - rfm_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### RandomForest Classifier #######
Score : 0.9798

MSE    : 568.63 
R2     : 0.98 
Adjusted R Squared: 0.979815779756908
In [86]:
plt.figure(figsize=(7,7))     
sns.regplot(y_val, rfm_pred, fit_reg=True)
C:\Users\Menna Fawzy\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Out[86]:
<AxesSubplot:xlabel='price'>

Support Vector Classifier

The objective of a Linear Support Vector Classifier is to fit to the data you provide returning a best fit hyperplane that divides or categorizes the data

In [88]:
from sklearn.svm import SVR
svrm = SVR(C=1000)
svrm.fit(x_train, y_train) 
svrm_pred = svrm.predict(x_val)
svrm_r2 = svrm.score(x_val, y_val)
svrm_mse = mean_squared_error(y_val, svrm_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('####### support vector Classifier #######')
print('Score : %.4f' % svrm.score(x_val, y_val))
print('')
print('MSE    : %0.2f ' % svrm_mse)
print('R2     : %0.2f ' % svrm_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - svrm_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### support vector Classifier #######
Score : 0.9401

MSE    : 979.67 
R2     : 0.94 
Adjusted R Squared: 0.9400891173589735
In [89]:
plt.figure(figsize=(7,7))     
sns.regplot(y_val, svrm_pred, fit_reg=True)
C:\Users\Menna Fawzy\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Out[89]:
<AxesSubplot:xlabel='price'>
In [ ]:
Bagging Regressor

Bagging regressors are similar to bagging classifiers They train each regressor model on a random subset of the original training set and aggregate the predictions Then the aggregation averages over the iterations because the target variable is numeric

In [95]:
from sklearn.ensemble import BaggingRegressor
bgm = BaggingRegressor(n_estimators=500, max_samples=30000, bootstrap=True, bootstrap_features=False)
bgm.fit(x_train, y_train) 
bgm_pred = bgm.predict(x_val)
bgm_r2 = bgm.score(x_val, y_val)
bgm_mse = mean_squared_error(y_val, bgm_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('####### Bagging Regressor #######')
print('Score : %.4f' % bgm.score(x_val, y_val))
print('')
print('MSE    : %0.2f ' % bgm_mse)
print('R2     : %0.2f ' % bgm_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - bgm_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### Bagging Regressor #######
Score : 0.9815

MSE    : 544.82 
R2     : 0.98 
Adjusted R Squared: 0.9814708430625516
In [96]:
plt.figure(figsize=(7,7))     
sns.regplot(y_test, bgm_pred, fit_reg=True)
C:\Users\Menna Fawzy\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Out[96]:
<AxesSubplot:xlabel='price'>
In [ ]:
MultiLayer Perceptron Regressor

MLPRegressor is an artificial neural network model that uses backpropagation to adjust the weights between neurons in order to improve prediction accuracy MLPRegressor implements a Multi-Layer Perceptron algorithm for training and testing data sets using backpropagation and stochastic gradient descent methods

In [98]:
from sklearn.neural_network import MLPRegressor
mlpreg = MLPRegressor(hidden_layer_sizes=(300, ), activation='relu', solver='adam', alpha=1000, batch_size='auto', max_iter=30000, shuffle=False, random_state=None)
mlpreg.fit(x_train, y_train) 
mlp_pred = mlpreg.predict(x_val)
mlp_r2 = mlpreg.score(x_val, y_val)
mlp_mse = mean_squared_error(y_val, mlp_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print("r2:",mlp_r2,"rmse:",mlp_mse)
print('####### MLPRegressor #######')
print('Score : %.4f' % mlpreg.score(x_val, y_val))
print('')
print('MSE    : %0.2f ' % mlp_mse)
print('R2     : %0.2f ' % mlp_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - mlp_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
r2: 0.9753447053436244 rmse: 628.6673734627636
####### MLPRegressor #######
Score : 0.9753

MSE    : 628.67 
R2     : 0.98 
Adjusted R Squared: 0.9753286864961588
In [101]:
pd.DataFrame({
              'R-Squared': [dlin_r2, dtm_r2, rfm_r2, svrm_r2,bgm_r2,mlp_r2],
             'MSE': [dlin_mse, dtm_mse, rfm_mse, svrm_mse ,bgm_mse,mlp_mse], 
            
             }, 
            index=['Linear Regression', 'DecissionTree Classifier', 'RandomForest Classifier','Support Vector Classifier','Bagging Regressor','MultiLayer Perceptron Regressor'])
Out[101]:
R-Squared MSE
Linear Regression 0.907241 1219.395654
DecissionTree Classifier 0.975814 1219.395654
RandomForest Classifier 0.979829 568.631317
Support Vector Classifier 0.940128 979.665118
Bagging Regressor 0.981483 544.819465
MultiLayer Perceptron Regressor 0.975345 628.667373

Best Model¶

Bagging Regressor is the best model it Have r2 score 0.98 it's the highest r2 score from the rest of the models and it's mean squared error is The smallest number compared to the rest of the models as you can see in the above table

Testing The Best Model¶

In [123]:
# importing the data and analyse it 
testData = pd.read_csv('diamonds_test.csv')
testData.head(15)
Out[123]:
Unnamed: 0 carat cut color clarity depth table x y z
0 0 0.30 Ideal H SI2 60.0 56.0 4.41 4.43 2.65
1 1 0.34 Ideal D IF 62.1 57.0 4.52 4.46 2.79
2 2 1.57 Very Good I VS2 60.3 58.0 7.58 7.55 4.56
3 3 0.31 Ideal H VS2 61.8 57.0 4.32 4.36 2.68
4 4 1.51 Good I VVS1 64.0 60.0 7.26 7.21 4.63
5 5 0.70 Very Good E SI1 59.6 63.0 5.72 5.65 3.39
6 6 0.51 Premium F SI2 58.3 61.0 5.18 5.14 3.01
7 7 1.55 Very Good I VS1 59.0 58.0 7.56 7.63 4.48
8 8 0.41 Ideal D SI1 62.2 57.0 4.76 4.70 2.94
9 9 0.30 Very Good H VS2 62.5 58.0 4.26 4.28 2.67
10 10 1.23 Very Good G VVS1 61.3 57.0 6.88 6.96 4.24
11 11 2.54 Very Good H SI2 63.5 56.0 8.68 8.65 5.50
12 12 0.90 Premium E SI1 59.8 58.0 6.26 6.21 3.73
13 13 0.90 Good E SI1 62.2 65.0 6.13 6.08 3.80
14 14 0.76 Very Good F VS2 62.0 58.0 5.80 5.86 3.62
In [124]:
#Dropping dimensionless features
testData = testData.drop(testData[testData['x'] == 0].index)
testData = testData.drop(testData[testData['y'] == 0].index)
testData = testData.drop(testData[testData['z'] == 0].index)
In [125]:
testData["volume"] = testData.x * testData.y * testData.z
In [126]:
testData.head()
Out[126]:
Unnamed: 0 carat cut color clarity depth table x y z volume
0 0 0.30 Ideal H SI2 60.0 56.0 4.41 4.43 2.65 51.771195
1 1 0.34 Ideal D IF 62.1 57.0 4.52 4.46 2.79 56.244168
2 2 1.57 Very Good I VS2 60.3 58.0 7.58 7.55 4.56 260.964240
3 3 0.31 Ideal H VS2 61.8 57.0 4.32 4.36 2.68 50.478336
4 4 1.51 Good I VVS1 64.0 60.0 7.26 7.21 4.63 242.355498
In [127]:
testData = testData.drop(columns={"x","y","z"})
In [128]:
testData.drop('Unnamed: 0',inplace=True,axis=1)
In [132]:
def test(testData):

  # Data Transformation
  testData['clarity']=testData['clarity'].replace(['IF','VVS1','VVS2','VS1','VS2','SI1','SI2','I1'],[8,7,6,5,4,3,2,1])
  testData['color'] = testData['color'].replace(['D','E','F','G','H','I','J'],[7,6,5,4,3,2,1])
  testData['cut'] = testData['cut'].replace(['Ideal','Premium','Very Good','Good','Fair'],[5,4,3,2,1])
  X=DataFrame(testData,columns =['carat','cut','color','clarity','depth','table','volume'])
  Y= bgm.predict(X)
  
  upper_bound = Y + bgm_mse
  lower_bound = Y - bgm_mse
  print(f'The price predited by our model is {round(np.e**Y[0],2)}')
  print(f'The price predited by our model is in range between {round(np.e**lower_bound[0],2)} \
and {round(np.e**upper_bound[0],2)}')
In [137]:
testData1=pd.DataFrame({'carat': 0.7,
                                    'clarity': 'SI1' ,
                                    'cut': 'Very Good', 
                                    'color': 'E',
                                    'table': 63,
                                    'depth' : 59.6,
                                    'volume' : 109.55
                                    }, index =[0])
In [138]:
test(testData1)
The price predited by our model is inf
The price predited by our model is in range between inf and inf
C:\Users\Menna Fawzy\AppData\Local\Temp\ipykernel_9476\682300255.py:12: RuntimeWarning: overflow encountered in double_scalars
  print(f'The price predited by our model is {round(np.e**Y[0],2)}')
C:\Users\Menna Fawzy\AppData\Local\Temp\ipykernel_9476\682300255.py:13: RuntimeWarning: overflow encountered in double_scalars
  print(f'The price predited by our model is in range between {round(np.e**lower_bound[0],2)} \
C:\Users\Menna Fawzy\AppData\Local\Temp\ipykernel_9476\682300255.py:14: RuntimeWarning: overflow encountered in double_scalars
  and {round(np.e**upper_bound[0],2)}')
In [ ]: